Skip to content

Scoring tests separately#350

Open
seroperson wants to merge 24 commits intoentrius:testfrom
seroperson:i339-tests-separate-scoring
Open

Scoring tests separately#350
seroperson wants to merge 24 commits intoentrius:testfrom
seroperson:i339-tests-separate-scoring

Conversation

@seroperson
Copy link
Copy Markdown

Closes #339

This PR addresses the issue that test code affect scoring process.

Before

  • Tests + source + non-code changes are counted for contribution bonus and also counted in density calculation.
  • Adding tests significantly reduce your score, because they give almost no score, but still affect density.
  • Adding non-code changes also affect the source density and may lower the score. The situation is better than with tests, as recognizable file (via extension) at least contribute some score. But unrecognized files still always lower the score, as they give zero score and affect density.
  • Situation is the same with deleted and binary files: they contribute no score, but still affect the density.

After

  • Test + source + non-code changes has its' own density and calculate its' own score.
  • Only source changes are counted for contribution bonus.
  • After each category is scored calculated, they are summed.

Implementation details

  • Added ScoringCategory enum. Possible values are TEST (if it's a test file), SOURCE (if scoring method is tree_sitter and it's not a test file), NON_CODE (everything else: recognized non-code changes, unrecognized non-code changes, deleted files, binary files)
  • Added PrScoringResultCategorized, which holds PrScoringResult per each existing category.
  • Scoring logic moved from scoring.py into PrScoringResultCategorized methods.

Test cases

Test files:

  • test_adding_tests_does_not_reduce_score - Adding test files to a source PR never lowers the base score
  • test_tests_do_not_affect_contribution_bonus - Small and large test files produce modest, similar increases (test weight is 0.05x)
  • test_same_code_in_test_path_scores_much_lower - Identical code in a test directory scores 10x+ lower than in a source path
  • test_tests_do_not_affect_threshold - Test files can't push a below-threshold PR past the token score threshold

Non-code files:

  • test_adding_non_code_files_does_not_reduce_score - Adding non-code files (markdown, yaml) never lowers the base score
  • test_non_code_does_not_affect_contribution_bonus - Small and large non-code files produce the same density increase (no bonus impact)
  • test_source_code_scores_much_higher_than_non_code - Tree-diff scored source code scores 10x+ higher than line-count scored non-code files
  • test_non_code_does_not_affect_threshold - Non-code files can't push a below-threshold PR past the token score threshold

Zero-score files:

  • test_deleted_file_does_not_change_score - Deleted files (score=0) don't affect the base score
  • test_unsupported_file_does_not_change_score - Unsupported extensions (score=0) don't affect the base score

Density:

  • test_adding_test_category_increases_score_beyond_single_cap - Per-category density cap allows multiple categories to contribute independently
  • test_verbose_formatting_decreases_score - Same logic in more lines produces lower density and lower score
  • test_modified_file_scores_diff_only - Modified files score only the AST diff, not the entire file

Threshold:

  • test_below_threshold_scores_less - Trivial changes below token score threshold score less than substantial changes

anderdc and others added 14 commits March 20, 2026 14:34
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Zanie Blue <contact@zanie.dev>
…ate scoring (entrius#314)

Co-authored-by: root <root@135-181-76-236.ptr>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: anderdc <me@alexanderdc.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anderdc
Copy link
Copy Markdown
Collaborator

anderdc commented Apr 7, 2026

just merged big changes into test please fix conflicts thanks

bittoby and others added 4 commits April 6, 2026 22:39
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
…res (entrius#346)

Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
@seroperson seroperson force-pushed the i339-tests-separate-scoring branch 2 times, most recently from d9ff4cd to d8a9df5 Compare April 7, 2026 08:50
@seroperson
Copy link
Copy Markdown
Author

@anderdc I've rebased it on current test 👍

tmimmanuel and others added 4 commits April 9, 2026 10:15
…ount (entrius#340)

Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
…s#351)

Co-authored-by: Ander <61125407+anderdc@users.noreply.github.com>
Co-authored-by: anderdc <me@alexanderdc.com>
@seroperson seroperson force-pushed the i339-tests-separate-scoring branch from d8a9df5 to 4ade5b6 Compare April 9, 2026 23:23
@anderdc
Copy link
Copy Markdown
Collaborator

anderdc commented Apr 10, 2026

I have a PR going into test now that will fix that pyright type check CI, don't worry about it

Copy link
Copy Markdown
Collaborator

@anderdc anderdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Good problem identification — test and non-code files dragging down source scores via density is a real issue and the per-category approach is the right direction. The ScoringCategory enum and the category property on FileScoreResult are clean additions. The test suite is thorough and covers the right invariants.

However, the implementation adds more scaffolding than necessary and introduces a redundant iteration. Requesting changes on the following:

1. Don't loop twice over file results

from_file_results() re-iterates all file_results to categorize and re-sum totals that the existing loop in tree_sitter_scoring.py already computes. The category is trivially known at each append site (is_test_file + scoring_method), so just accumulate into per-category dicts inline during the existing loop. No second pass needed.

2. Extend PrScoringResult instead of adding PrScoringResultCategorized

The new wrapper class duplicates fields from PrScoringResult (total_score, total_nodes_scored, score_breakdown) and adds indirection. Since PrScoringResult is only constructed in one place and consumed in one place, just add by_category: Dict[ScoringCategory, PrScoringResult] and total_lines: int to the existing class. No need for a parallel class hierarchy or the _EMPTY_SCORING_RESULT sentinel.

3. Keep scoring logic in scoring.py

calculate_initial_base_score() and calculate_contribution_bonus() are business logic that reference scoring constants (MERGED_PR_BASE_SCORE, MAX_CONTRIBUTION_BONUS, CONTRIBUTION_SCORE_FOR_FULL_BONUS). This logic currently lives in scoring.py alongside the rest of the scoring decisions. Moving it into a dataclass in classes.py scatters scoring logic across two files. Keep it in scoring.py — the caller can iterate by_category directly with a few lines.

4. Threshold check may not isolate categories correctly

In calculate_initial_base_score(), the threshold uses self.score_breakdown.total_score — the aggregate across all categories including test and non-code. A PR with trivial source but a large test suite could pass the threshold via test score alone. The existing tests for this (test_tests_do_not_affect_threshold) pass token_score=0.0 explicitly on the factory, so they don't catch this path. The intent is that only SOURCE contributions matter for the threshold, the check should use the source category's score, not the aggregate.

5. Preserve diagnostic detail in log output

The current log shows density and bonus percentage:

Base score: 15.00 (density 0.50) + 3.0 bonus (10% of max 30) = 18.00

The PR reduces this to:

Base score: 15.00 + 3.0 bonus = 18.00

With per-category scoring, the density breakdown is actually more useful to log, not less. Please preserve or improve the diagnostic output.

@anderdc anderdc force-pushed the test branch 2 times, most recently from 8adcdf6 to 9c9860f Compare April 10, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tests affect code density and reduce resulting score very much